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FPGA memory controller 
links embedded jxP to cache- 
enhanced DRAM 

James Joseph, Ramtron International Corp, and Charles Brown, Intel Corp 



Twice as fast as most stan- 
dard DRAMs, an "enhanced 
DRAM" (EDRAM) combines 
as many as 4 Mbits of 35- 
nsec DRAM with a 2-kbit, 
15-nsec SRAM cache on one 
chip (Fig 1). Because of the 
built-in cache, some vari- 
eties of this chip can com- 
plete zero-wait-state read 
cycles at clock rates as high 
as 50 MHz — without inter- 
leaving memory banks. 

Interleaving allows clock speeds as high as 100 MHz. 

Their internal caches give the memory devices some 
unusual properties compared with conventional DRAM. 
First, as long as the system is reading or writing from the 
device's caches, refreshing the DRAM can proceed in the 
background without interrupting memory accesses. Second, 
and more subtly, the devices internally latch the RE signal. 
Thus, the memory controller can unassert the external RE 
signal and begin precharging the next memory access earli- 
er than would be the case with conventional DRAM. 

System sketch 

To take advantage of EDRAMs, you need a controller tai- 
lored to the memory's particular operations. Fortunately, 
you can make such a controller from a single field-program- 
mable gate array (FPGA). You must carefully select the FPGA 
to meet the memory subsystem's tight timing constraints. 
Specifically, you can build a 32-bit RISC system having as 
large as a 16-Mbyte memory using EDRAM SIMMs and one 
FPGA controller. Such a controller can access 4 to 16 Mbytes 



The listings this article mentions are available on the EDN Readers' 
BBS. Phone (61 7) 558-4241 with modem selling 1200/2400 8,N,1 
(9600 baud=(61 7) 558-4580). From the Main System Menu, enter 
ss/freeware. Then, from the /freeware SIG menu, enter rkms818. 



DRAM chips often lack the performance that 

embedded systems require. Also, SRAM 
takes up too much space and is too expensive 

for any application using over 1 Mbyte of 
memory. But designers have another option: 
a memory built with DRAM enhanced 
with an on-chip cache. 




of EDRAM without incur- 
ring wait states. 

To see how a one-chip 
memory controller can meet 
practical size, performance, 
and cost specifications, con- 
sider an embedded system 
built around a 33-MHz 
i960CF (jlP (Fig 2). This 33- 
MHz, 32-bit RISC jxP incor- 
porates a 4-kbyte instruction 
cache, 1 -kbyte data cache, 
and 1 -kbyte data memory. 
Its external nonmultiplexed 32-bit address and data buses 
allow burst-read and -write data transfers to external memo- 
ry at rates as high as 132 Mbytes/sec. 

In this design, as many as 16 Mbytes of EDRAM fit into 
two 72-pin SIMM sockets, taking 3 in. 2 of board space. The 
33-MHz i960 needs 15-nsec EDRAM chips. For lower cost, a 
25-MHz ilP works with less expensive 20-nsec chips. 

Figs 3, 4, and 5 show a worst-case timing analysis per- 
formed with Chronology's Timing Designer software using 
timing parameters entered from the data books for the 
design's components. Figs 3 and 4 show the timing wave- 
forms needed to describe the design's read and write cycles. 
The labels on the vertical axes of Figs 3 and 4 and the third 
column of Fig 5 match the nomenclature of each device's 
maker. The devices in this design are derated to allow them 
to drive added load capacitance. 

(Ed Note: Although we tried to make the symbols in the article 
match the symbols in the devices' literature, one exception 
remains: the inversion operator — a leading slash ("/") — in the 
listing. Except for PALASM, programmable-device programming 
languages' inversion operators react to signals defined as active- 
low in eccentric and unwelcome ways. Consequently, signals 
defined as active-low by a leading slash or a trailing number sign 
in their literature (for example, /BLAST or RDY#) appear with a 
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Table 2 — Controller-to-memory signals 



EMBEDDED SYSTEMS 



trailing underscore (for example, BLAST or RDY ) in the listing. 
In the listing, a leading slash simply and unequivocally inverts 
the logic variable it precedes (for example, /BLAST ) . 

The timing analyses might be hard to interpret if you do 
not know that i960 /aP does not distinguish between burst 
and single-cycle memory operations. Some /aPs' status lines 
indicate one condition for burst memory operations and 
another for single-cycle memory operations. Instead, the 
i960 performs only burst memory operations but happily 
allows "bursts" of length one — a single-cycle operation, in 
effect. Thus, you need examine only burst memory opera- 
tions that the figures present. 

Critical timing 

The most critical timing requirements that the system's 
memory controller must meet are a 6-nsec max clock-to-out- 
put delay and a 6.5-nsec data-setup time. Achieving this per- 
formance is difficult for most complex PLDs (CPLDs) and 
FPGAs. An Intel iFX780-10 FPGA can meet these require- 
ments. 

This 132-pin FPGA is an 80-macrocell device that features 
a fixed 10-nsec propagation delay between all input (or I/O) 
and output pins. Internal registers' synchronous setup times 
are not more than 6.5 nsec, and clock-to-output delays are as 
short as 6 nsec. 

For the u,P to take advantage of the memory's high speeds, 
the memory controller must correspondingly swiftly address 
multiplexing and decoding and refresh cycles. 

The controller must also perform functions for the u.P and 
memory to communicate — including cache-hit/miss com- 
parisons. The memory internally performs its own cache- 
hit/miss comparisons. But the memory does not externally 
signal a cache miss. Therefore, the controller registers the 



Controller 


Memory 


MALO to 9 


Multiplexed address, Bank 0, 1 


MAHO to 9 


Multiplexed address, Bank 2, 3 


MALA10 


Bank 0, MAI 


MALB1 


Bank 1, MA10 


MAHA1 


Bank 2, MAI 


MAHB10 


Bank 3, MAI 


/RFO to 3 


Row enables for banks to 3 


/CALO to 3 


Column-address latch inputs for bytes to 3 


W/RO, 1 


Write/read-mode input for low/high banks 


/FO, 1 


Refresh-mode input for low/high banks 


/SO to 3 


Chip selects for banks to 3 


/CO, 1 


Output enable for low/high banks 


/WEO, 1 


Write enable for low/high banks 



upper half of the address it read last. The controller then 
compares that address with the current read address to 
deduce if the memory will read from its cache or DRAM 
array. If the read operation accesses the memory's DRAM 
array, the controller commands the jiP to wait. 

In short, using a simple single-phase clock, the controller 
must supervise all of the jxP's memory transactions and incur 
a minimum of memory wait states. 

These memory transactions include one to four 32-bit- 
word read operations and 1- to 4-byte, word, and long- word 
write operations. 

For each of these transactions, the FPGA's 6-nsec clock- 
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Twice as fast as most standard DRAMs, an "enhanced DRAM" (EDRAM) combines up to 4 Mbits of 35 nsec DRAM and a 2-kbit, 
15-nsec SRAM cache on one chip. In addition, the device's DRAM to-cache bus is 256 bytes wide. 
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to-output delay and 6.5-nsec data-set- 
up time allow zero-wait-state, nonin- 
terleaved operation. 



Dealing out the loads 

A fast FPGA, however, is not enough. 
Achieving top performance requires that 
the design divide address and control 
loading between two address buses, each 
driving as many as two memory banks 
of 8 Mbytes each. Similarly, the design 
splits heavily loaded signals that drive 
both memory banks, such as W/R, /F, 
/WE, and /G, into two parts. 

In contrast, the /CALO through 
/CAL3 signal lines are lightly loaded and 
can, therefore, drive all four memory 
banks from one pin. Tables 1 and 2 
show the jjlP's address and control signals that feed the 
memory controller as well as the memory signals the con- 
troller must generate. 



Table 1 — Controller-to- 
processor SIGNALS 
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A2 to 31 


Address bus 


BE#0 to 3 


Byte enables 


ADS# 


Address strobe 


W/R# 


Write/read mode 


BLAST# 


Burst-last 2 


RDY# 


Non-burst-ready 




acknowledge 


BTERM# 


Burst-ready acknowledge 


Reset* 


Processor-reset input 


PCLK 


Processor's clock output 



the controller does not perform the 
next refresh until the memory again 
asserts REF_. In this way, the controller 
refreshes the memory every 62.5 p.sec. 

Initializing the memory is simple. 
Upon power-up, the memory requires 
only two read-miss cycles per bank for 
initial resetting. The start-up routine 
in the system's bootstrap ROM usual- 
ly executes these cycles. Then, when 
the system releases the reset line, the 
controller enters its idle state and waits 
for memory transactions. Subsequent- 
ly, a series of at least eight refresh 
cycles resets the memory's logic, but 
not its contents. 



Memory sequences 

When the controller's address decoder detects a valid 
memory address, the controller enables the memory for 
memory transactions. In this design, the memory uses only 



Figure 2 
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Around the blocks 

The controller's main blocks (Fig 6) 
comprise a state machine (Fig 7), an 
optional refresh counter, an address 
decoder, an address multiplexer, and a 
burst-address generator. Listing 1 (see 
pg 136) contains the equations for con- 
figuring the state-machine portion of 
the controller— the most complex sec- 
tion. Dittos indicate repetitious sec- 
tions of the listing, which are cut to 
save space. You can find the full listing 
the EDN Readers' BBS, along with a ver- 
sion of the memory controller's pro- 
gram for an 80486 as MS818 on the 
/freeware Special Interest Group. Using 
the i960 and the 80486 versions for ref- 
erence, you should be able to adapt the 
design to other p,Ps. 



State machine's core 

At the heart of the controller is the 
state machine, which executes the 
memory's control sequences (Fig 7). 
When the system issues a reset com- 
mand, the controller initializes its logic. 
Then, for as long as the system holds its 
reset line low, the controller performs 
only refresh cycles. 

Many designers provide a signal specifically for memory 
refreshing. If an external, 16-kHz refresh signal is not avail- 
able, then the FPGA has enough spare logic to divide a clock 
signal down internally. 

Listing 1 and Fig 7 show that, when the state machine 
clears the memory's refresh request by entering the F2 state, 
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This embedded system uses a 33-MHz Intel I960CF RISC u.P. The design's 16 Mbytes 
of EDRAM fit into two 72-pin SIMM sockets, taking up 3 in. 2 of board space. 

the lower 16 Mbytes of the address range. Normally, other 
memory and I/O devices would map into the remaining 
address space. 

Under control of the state machine, the controller's 
address multiplexer selects the row, column, or burst address 
it will apply to the memory's multiplexed address inputs. 
Again, to limit the capacitive load that the controller must 
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drive, the design divides MA (o 10 
between two independent output 
buses. Using multiple outputs limits the 
clock-to-output delay to 8 nsec, allow- 
ing zero-wait-state operation. 

The jjlP's burst-mode transfers can 
move one to four words. Consequently, 
the controller's burst generator toggles 
the lower two multiplexed address bits, 
MA , appropriately. These bits, along 
with bits MA 2 to 8 , form the memory's 
column address. 

A memory sequence starts when the 
address-strobe signal (ADS#) anil a valid 
address are present. Hie exact control 
sequence that the state machine exe- 
cutes and the series of bus events that 
transpires depend on the state of the 
(xP's read/write output; that is, whether 
the |j.P is requesting a memory-read or 
-write operation. 



Figure 3 
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Timing Analysis - Read Burst Cycle 



Figure 4 
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Read operations 

In the case of a read, the controller 
must fetch one to four 32-bit words 
from the memory. These data can be in 
either the cache or the EDRAM's inter- 
nal DRAM array. The controller regis- 
ters the upper 16 bits of the previous 
read address. By comparing this regis- 
tered address with the corresponding 
upper bits of the current address, the 
controller can detect read misses. That 
is, because of the size of the memory's 
cache relative to the size of its DRAM, if 
the upper 16 bits of the address do not 
change from one read to another, then 
the data are in the cache. 

During a memory read, the state 
machine starts with a transition to state 
Rl. There, the controller presents row 
address and chip-select signals to the 
memory and puts the memory in its 
read mode. 

For the controller to access a row of 
DRAM within the EDRAM chip, the 
controller must clock the row-enable 
signal of the selected memory bank 13 
to 18 nsec after the start of state Rl. 
Once accessed, the DRAM row trans- 
fers its contents internally to SRAM 
cache. 

In the next state, RB, the controller 
presents the column-address and output-enable signals to 
the cache memory and returns UDY# and BTERM# acknowl- 
edgment signals to the jjlP, confirming that the controller 
has accessed the data. All control signals are available with- 



As this timing analysis shows, the system |jP can read successive 32-bit words in a 
burst memory without incurring wait states. 
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Tlming Analysis - Write Burst Cycle 



Somewhat similar to the read-burst sequence in Fig 3, write-burst sequences trans- 
fer all but the first of a burst of 32-bit words with no wait states. 

in 6 to 8 nsec after the |xP clock's rising edge. 

The state machine cycles through state RB one to four 
times, depending on how many words the p,P wants to trans- 
fer in a burst. During each new RB state, the controller sends 
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a new burst address to the memory. During this time, out- 
put-enable, chip-select, and BTERM# signals all remain 
valid. 

Background refresh cycle 

If a refresh request is pending when a burst sequence 
begins, the state machine lets the memory execute an inter- 
nal refresh operation during cache-read cycles, taking advan- 
tage of the EDRAM's architecture. Such a background refresh 
operation starts at the second RB state and continues 
through the third and fourth RB states. During the second 
and third RB states, the controller brings the /F control sig- 
nal low as the state machine clocks the memory's row-enable 
lines. 

The memory can complete a background refresh opera- 
tion during a four-word burst-mode transfer. However, for 
shorter burst-mode transfers, the background refresh opera- 
tion will still be in progress when the transfer ends. In this 
case, the state machine jumps to state F2. This move allows 



the refresh operation to finish before the controller returns 
to its idle state. 

Like read operations, write operations can happen in one- 
to four-word bursts. Further, the jaP can specify which byte 
to overwrite in a 32-bit word. The uT's three byte-enable lines 
determine which bytes to overwrite in memory. 

The state machine first advances to state Wl . To enable the 
requested memory locations, the state machine activates the 
appropriate column-address-latch inputs in the memory 
bank. Next, the state machine selects the corresponding row- 
address and chip-select signals for that bank and activates the 
memory's write mode. The state machine also sends RDY# 
and BTERM# acknowledgments to the uJ\ The controller 
issues a row-enable signal between 13 and 18 nsec after the 
start of state Wl. 

Advancing to the next state, WB, the state machine issues 
the column-address and, to latch the forthcoming data, a 
write-enable signal. The column-address-latch signals leave 
the controller from 13 to 18 nsec after the start of state WB. 



Figure 5 
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Along with Figs B and 4, this worst-case timing analysis performed with Chronology Timing Designer software uses timing 
parameters for the design components. The third column matches the nomenclature of each device's maker. 
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As long as the |xP keeps its burst-write 
signal, BLAST# high, the (xP will con- 
tinue to issue data in bursts until it 
releases BLAST#. Depending on the 
state of BLAST#, the state machine will 
repeat WB as many as three more times. 
During each pass through the WB state, 
the controller generates a new column 
address. It also activates the memory's 
write-enable and column-address-latch 
inputs to write the incoming data to 
memory. 

The last word 

After writing the last word, the con- 
troller ends the burst-write operation by 
deactivating the memory's write- 
enable, column-address-latch, and row- 
enable inputs. 

Applications for a fast, 32-bit embed- 
ded system such as this one include net- 
work bridges and routers, disk-array 
controllers, laser printers, telecommu- 
nication switches, video servers, and 
image processors. MUffll 
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A memory controller tailored to EDRAM's operations needs a refresh counter, an 
address multiplexer, a burst-address generator, an address decoder, and a 
read/write/refresh state machine. The controller must accept and acknowledge 
certain commands from the system u.P as well as provide properly partitioned com- 
mands and data to the memory. 
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This state machine is the heart of the memory-controller FPGA. Starting from reset, 
the state machine supervises all the memory's control sequences. 



